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ABSTRACT 

The IEEE 802.16d communication standard uses orthogonal frequency division multiplexing (OFDM). In the widely used OFDM 
systems, the fast Fourier transform (FFT) and inverse fast Fourier transform pairs are used to modulate and demodulate the data 
on the sub-carriers. In this paper, a high level implementation of a high performance FFT for OFDM modulator and demodulator is 
presented. The design has been coded in Verilog and targeted into Xilinx Spartan3 field programmable gate arrays. Radix-22 
algorithm is proposed and used for the OFDM communication system. The design of the FFT is implemented and applied to fixed - 
IEEE 802.16d communication standard. The results are tabulated and the hardware parameters are compared. The proposed 
architecture is least in number of multipliers used and the memory size, and second to the least in number of adders used. 


Keywords - Fast Fourier Transform, Orthogonal Frequency Division Multiplexing, Radix Conversion, Verilog. 


1. INTRODUCTION 


Orpm technology is used for many communication systems such as Asymmetric Digital Subscriber Line (ADSL), 
Wireless Local Area Network (WLAN) or multimedia communication services (T. Lenart et al. 2004). One of the key 
components in OFDM system is the Fast Fourier Transform (FFT). There are more and more communication systems 
requiring higher points FFT and higher symbol rates. The requirement establishes challenges for low power and high 
speed FFT design with large points. In the target application the IEEE 802.16d (fixed WiMAX) standard requires the 
OFDM symbol rates from 1.75 MHz to 20 MHz and the FFT up to 2048 points (T. Lenart et al. 2004). 

There are in general two approaches in implementing the FFT for OFDM processing: the pipeline processing and 
the memory-based recursive processing. In the search for high performance FFT, this paper presents a pipelined 


2 
architecture called R2 SDF, of which the implementation on Field Programmable Gate Arrays (FPGAs) uses Verilog 
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Hardware Description Language (HDL). The next section describes architecture and design methodology, followed by 
its implementation in Verilog code and utilization, performance and implementation in OFDM systems. Finally, this 


project concludes with a comparison of hardware requirement of R2'SDF and several other popular pipeline 
architectures. 

The proposed work is based on a memory based recursive FFT design which has much less gate counts, lower 
power consumption, and higher speed. The speed performance of the design easily satisfies most application 
requirements of IEEE 802.16d communication standard, which uses OFDMA modulated wireless communication 
system. The design uses fewer gates and hence lower cost and power consumption. WiMAX profiles based on 802.16- 
2004 are better suited to fixed applications that use directional antennae because OFDM is inherently less complex 
than SOFDMA. As a result, 802.16-2004 networks may be deployed faster and at a lower cost. In addition, 802.16- 
2004 WiMAX Forum CERTIFIED products will be available earlier and will be adopted by service providers that plan to 
deploy a network in the near future. OFDMA gives 802.16e profiles more flexibility when managing different user 
devices with a variety of antenna types and form factors. It brings a reduction in interference for user devices with 
omni directional antennas and improved NLOS capabilities that are essential when supporting mobile subscribers (H. 
Lin et al. 2008). Sub channelization defines sub channels that can be allocated to different subscribers depending on 
the channel conditions and their data requirements. These gives the operator more flexibility in managing the 
bandwidth and transmit power, and leads to a more efficient use of resources. 


2. ARCHITECTURE AND DESIGN METHODOLOGY 

2.1. Radix-2? Decimation in Frequency FFT Algorithm 

A useful state-of-the-art review of hardware architectures for FFTs was given in F. Wakerly et al. (2006) and different 
approaches were put into functional blocks with unified terminology. From the definition of DFT of size N (Y.Zhang et 
al. 2007), 


x(k) = EXcianywe*, o<k<N (1) 


Where Wy denotes the primitive Nth root of unity, with its exponent evaluated modulo N, x(n) is the input sequence 
and X(k) is the DFT. He F. Wakerly et al. (2006) applied a 3-dimensional linear index map, 


iv N { 
n= 1 +382 +n) (2) 


k= (ki+2k2+4k3) 


and common factor algorithm (CFA) to derive a set of 4 DFTs of length N/4 as 


w 
{4 
* 
n=0 


x( ki + 2K2 + 4k3) = z [2 Chey Kequmeg) Wy 2 * Ma) pial (3) 


Where ni, n2, n3 are the index terms of the input sample n and kl, k2, k3 are the index terms of the output sample k and 
H(ki, kz, n3) is expressed in (4). 


H(kykng) = [x(ng) + (—1)"*x (nq +5) + fy ™ x [x (n+ T+ x(n + D) ha 


Equation (4) represents the first two stages of butterflies with only trivial multiplications in the signal flow graph (SFG), as 
Butterfly | (BF2I) and Butterfly II (BF2H). Full multipliers are required after the two butterflies in order to compute the 


product of the decomposed twiddle factor wget 2h), (3). Note the order of the twiddle factors is different from that 


of radix-4 algorithm. Applying this CFA procedure recursively to the remaining DFTs of length N/4 in (3), the complete 
radix-22 decimation-in-frequency (DIF) FFT algorithm is obtained. The corresponding FFT flow graph for N-16 is shown in 
Figure 1. 


2.2. Radix-2? FFT Architecture 

Mapping radix-22 DIF FFT algorithm derived to the radix-2 SDF architecture, a new architecture of R22SDF approach is 
obtained (Sankhsawas et al. 2006). Figure 2 outlines an implementation of the R22SDf architecture for N-1024, note the 
similarity of the data-path to R2SDF and the reduced number of multipliers (Harikrishna et al. 2009). The implementation 
uses two types of butterflies: one identical to that in R2SDF, the other contains also the logic to implement the trivial twiddle 
factor multiplication, as shown in Figure 2 (a) and (b) respectively (Sankhsawas et al. 2006). Due to the spatial regularity 
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Figure 1 


R2? SDF Pipeline FFT Architecture for N=1024 
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Figure 2 
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Butterfly Structure for R22 SDF FFT 
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of radix-2? algorithm, the synchronization control of the processor is very simple. A (log2N)-bit binary counter serves 
two purposes: synchronization controller and address counter for twiddle factor reading in each stage. With the help 
of the butterfly structures shown in Figure 2, the scheduled operation of the R22SDF processor in Figure 1 is as 
shown. On the first N/2 cycles, the 2-to-1 multiplexers in the first butterfly module switch to position "O", and the 
butterfly is idle. The input data from left is directed to the shift registers until they are filled On next N/2 cycles, the 
multiplexers turn to position "I", the butterfly computes a 2-point DFT with incoming data and the data stored in the 
shift registers. 


Z,(n) = x(n) + x(n+ Nf2) 
Z, (n+) = x(n) — x(n +5) O<n<= 


The butterfly outputs Z:(n) and Z:(n+N/2) are computed according to the equations given in (5). Z1(n) is sent to apply 
the twiddle factors, and Z1(n+N/2) is sent back to the shift registers to be "multiplied" in still next N/2 cycles when 
the first half of the next frame of time sequence is loaded in. The operation of the second butterfly is similar to that of 
the first one, except the "distance" of butterfly input sequence are just N/4 and the trivial twiddle factor 
multiplication has been implemented by real-imaginary swapping with a commutator and controlled add/subtract 
operations, as shown in Figure 2 (a) and (b), which requires two bit control signal from the synchronizing counter. The 
data then goes through a full complex multiplier working at 75% utility, and accomplishes the result of first level of 
radix-4 DFT word by word Further processing repeats this pattern with the distance of the input data decreases by 
half at each consecutive butterfly stages. After N-1 clock cycles, the result of the complete DFT transform streams out 
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Table 1: Implementation result 


Logic utilization Available — Utilization (%) 


No. of slices 
No. of slice flip flops 
No. of 4 input LUTs 
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No. of bonded IOBs 
No. of 8x8 Multiplxers 
No. of GCLKs 


Table 2: Timing summary 


Figure 3 


Table 3 


Xilinx spartan 3 FPGA kit 


Hardware requirement comparison 


Multiplier # Adder # Memory size 
10.827 (ns) . 
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8- point FFT simulation result 


to the right, in bit-reversed order. The next frame of transform can be computed without pausing due to the 


pipelined processing of each stage. 


3. IMPLEMENTATION USING VERILOG 


The R22SDF presented above has been fully coded in Verilog HDL. Once the design is coded in Verilog, the Modelsim 
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XEHI 6.2c compiler and the Xilinx Foundation ISA Environment 9.li generate a net-list for FPGA configuration. The net- 
list can then be downloaded into the FPGA using the same Xilinx tools and Texas Instruments prototyping board. 
From the architecture of R22SDF in Figure 2, the butterfly blocks BF2I and BF2II are described as building blocks in 
Verilog code. Booth multiplication algorithm for signed binary numbers is used for complex multipliers. Thus, the 
overall latency of the real implementation varies as the processing word length changes. Look-up-table (LUT) based 
random access memories (RAMs) and flip-flops are used to implement feedback memory of the very last stages 
where the RAM blocks in the FPGA are used for the rest of the stages. Similarly, LUT-based read only memories 
(ROMS) are used to implement twiddle ROMS of the very last stages whereas block RAMs are used for the rest of 
stages (F. Wakerly et al. 2006). The FFT is heavily pipelined to achieve as highest clock frequency as possible. Twiddle 
factors are generated by an external program and embedded to the VHDL code. The implementation results after 
implementing in Xilinx Spartan3 FPGA (Figure 3) are listed in Table 1 and Table 2. Table 1 shows the implementation 
results whereas Table 2 shows the timing summary. The resulting figures show that om implementation outperforms 
the other implementations of that kind Its speed nearly matches that of the Xilinx core but its throughput is more 
than 3 times higher due to its pipeline nature. 


4. PERFORMANCE AND IMPLEMENTATION 

4.1. Hardware Requirement 

The radix-4 butterfly needs 3 complex adders and 1 complex multiplier, while the proposed butterfly structure needs 
only 4 complex adders and 1 complex multiplier. This is because of design implements the constant multiplier by 4 
reused complex adders. Figure 3 also shows the radix-8 butterfly and radix-4 butterfly. All of the above-mentioned 
use separated single static random access memory (S into 2 smaller SHAMS. This design can double SRAM throughput 
with inter-leaving access. In Table 3, the hardware requirement of the proposed design is compared with various 
pipelined designs. 

The radix-4 butterfly needs 3 complex adders and 1 complex multiplier, while the proposed butterfly structure 
needs only 4 complex adders and 1 complex multiplier. This is because om design implements the constant multiplier 
by 4 reused complex adders. Figure 3 also shows the radix-8 butterfly and radix-4 butterfly All of the above- 
mentioned use separated single static random access memory (S into 2 smaller SHAMS. This design can double SRAM 
throughput with inter-leaving access. In Table 3, the hardware requirement of the proposed design is compared with 
various pipelined designs. 


4.2. Power Consumption 

The power consumption is measured by the number of times of data transition. The data transition times are 
proportional to the SRAM access times. Here we assume that the adders and multipliers are active at each clock cycle 
because of the pipelining architecture. The more the SRAM access times are, the higher the power consumption is 
the SRAM access times versus N points FFT. The SRAM access times are linear to the number of the recursive 
iterations in FFT as described in (6).The SRAM accessed twice each clock cycle, so (6) is multiplied by 2. shows that the 
proposed design has less access than the radix-4 FFT by 20% to 40%. Therefore, the proposed architecture 
consumes much lower power. SRAM access times-N (iteration times ) x2 (6). 


4.3. Implementation 

The Fujitsu WiMAX SoC, MB87M3550, fully complies with the IEEE 802.16d standard using an OFDM. The system-on- 
chip (SoC) can operate in all the available channel bandwidths from 1.75MHz up to 20MHz bandwidths (C. H. Su et al. 
2009). This SoC is designed to support frequencies from 2 GHz to 11 GHz in both licensed and license-exempt bands. A 
programmable frequency selection generates the sample clock for any desired bandwidth. Uplink sub channelization 
is supported as defined in the standard The SoC's integrated ARM-926 RISC engine implements 802.16 upper layer 
MAC, scheduler, drivers, protocol stacks, and user application software. A multi-channel DMA controller handles high- 
speed transactions among various agents on a high performance bus. To offload processing from the upper layer MAC 
and enhance performance, the Fujitsu WiMAX SoC includes a separate ARC RISC/DSP engine to execute 802.16 lower 
layer MAC functions. The chip's multiple hardware-based encryption/decryption engines are tightly coupled with this 
lower layer MAC processor. 


5. CONCLUSION 

We have proposed a memory based recursive FFT design which has much less gate counts, lower power consumption, 
and higher speed The proposed architecture has three main advantages: 1) fewer butterfly iteration to reduce power 
consumption, 2) pipeline of radix-2? butterfly to speed up clock frequency, 3) even distribution of memory access to 
make utilization efficiency in SRAM ports. Simulation result of 8 bit FFT code is shown in figure 4. 
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